Missing Data and Imputation 1

Concepts, Challenges, and Strategies to Address Missing Data

Erik Westlund

2025-11-19

Why Missing Data Matters

  • Maternal health studies often rely on observational cohorts, so incomplete data is common.
  • Ignoring missingness can erase power, bias estimates, or make truth unknowable.
  • Today: simulation-first walkthrough with simple DAGs and linear models.

Simulated Data: Seeing Is Believing

  • Using simulated data allows us to see first hand how the concepts behind missing data strategies operate in practice.
  • They help build intuition for how missing data strategies succeed or fail.

Causal Thinking

  • Working with observational data does not mean we can skip causality; it makes it more important.
  • A defensible causal graph is needed to reason about missingness mechanisms.
  • Without it, multiple imputation can reinforce bias instead of fixing it.

Stylized Teaching Examples

  • The DAGs and simulations today are deliberately simple.
  • Real maternal-health data has more variables, feedback loops, and measurement issues.

Stylized Teaching Examples (cont.)

  • Treat these as “toy” examples to see how missingness mechanisms and fixes behave.
  • Next session: revisit with a more realistic, higher-dimensional DAG.

No Exit & Three Other Plays: Mechanisms

  • Everything is Perfect: Complete case analysis
  • Ignorable and Annoying: Missing Completely at Random (MCAR)
  • Dangerous but Manageable: Missing at Random (MAR)
  • No Exit: Missing Not at Random (MNAR)

Today’s Plan

  • Work through examples of MCAR and MAR to get an intution for their implications
  • Examine common “fixes” to show how they fail and motivate why we need multiple imputation

Ground Truth

  • Nutrition effect on birthweight (\(β_X\)): 0.5.
  • Observed factor effect on birthweight (\(β_E\)): 1.
  • Standard deviation around effect size (error): 2.
  • Keep these in mind as we go; they are the “truth” for all our examples.
beta_x_true    <- 0.5   # effect of nutrition on birthweight
beta_e_true    <- 1     # effect of the observed factor (e.g., education) on birthweight
sd_error_true  <- 2     # residual SD

Simulation Helper (for reference)

simulate_birth_data
function (n = 1000, beta_x = 0.5, beta_e = 1, sd_error = 2, mcar_rate = 0.5, 
    mar_logit_shift = 1.2, mar_depend = c("E", "X")) 
{
    mar_depend <- match.arg(mar_depend)
    E <- rbinom(n, 1, 0.5)
    X <- 1.5 * E + rnorm(n)
    Y <- 2 + beta_x * X + beta_e * E + rnorm(n, sd = sd_error)
    p_miss_mcar <- rep(mcar_rate, n)
    miss_mcar <- rbinom(n, 1, p_miss_mcar)
    X_mcar <- ifelse(miss_mcar == 1, NA_real_, X)
    mar_var <- if (mar_depend == "E") {
        1 - E
    }
    else {
        -as.numeric(scale(X))
    }
    alpha <- if (mar_logit_shift == 0) {
        qlogis(mcar_rate)
    }
    else {
        uniroot(f = function(a) mean(plogis(a + mar_logit_shift * 
            mar_var)) - mcar_rate, interval = c(-15, 15))$root
    }
    p_miss_mar <- plogis(alpha + mar_logit_shift * mar_var)
    miss_mar <- rbinom(n, 1, p_miss_mar)
    X_mar <- ifelse(miss_mar == 1, NA_real_, X)
    tibble(E, X, Y, X_mcar, miss_mcar, X_mar, miss_mar)
}

Base Causal Structure: E → Nutrition → Birthweight ← E

  • Nutrition (X) → Birthweight (Y)
  • Another variable E affects both
  • No missingness yet; this is the target data-generating process.
  • Full data model uses Y ~ X + E as our ‘truth’.

Base Relationship: What’s in E?

  • E stands in for any observed factor that affects both nutrition and birthweight.
  • Examples: education, clinic context, or barriers to care.
  • We’re using a simple version to make the mechanics visible.

Causal Graph

Simulate Data Fit the Truth Model

sim_data <- simulate_birth_data(
  n         = 5000,
  beta_x    = beta_x_true,
  beta_e    = beta_e_true,
  sd_error  = sd_error_true
)

full_fit <- lm(Y ~ X + E, data = sim_data)

Simulated Data Model Results


Call:
lm(formula = Y ~ X + E, data = sim_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6783 -1.3513 -0.0298  1.3986  7.7126 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.98314    0.04027   49.24   <2e-16 ***
X            0.50024    0.02878   17.38   <2e-16 ***
E            1.00465    0.07198   13.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.022 on 4997 degrees of freedom
Multiple R-squared:  0.2006,    Adjusted R-squared:  0.2003 
F-statistic: 627.1 on 2 and 4997 DF,  p-value: < 2.2e-16

Partial Relationship: Residualized X vs Y (conditioning on E)

MCAR: Missing Completely At Random

MCAR: The Happy Case

  • Missingness is truly random.
  • For MCAR, we randomly delete nutrition (X) for a subset of mothers.
  • Random missingness has no relationship to education or nutrition values.

MCAR: Impact

  • Missing data does not cause biased estimates. It just reduces precision/power.
  • Example: random equipment failures, accidentally dropped samples.

MCAR: Dag

Simulation Code: Base Data and MCAR

# Ensure base data exists (5000 rows defined earlier)
if (!exists("sim_data")) {
  sim_data <- simulate_birth_data(
    n         = 5000,
    beta_x    = beta_x_true,
    beta_e    = beta_e_true,
    sd_error  = sd_error_true
  )
}

# Full data model (gold standard) on the complete data
full_fit <- lm(Y ~ X + E, data = sim_data)

# Randomly mark 50% of X as missing (MCAR) for this demo
mcar_data <- sim_data |>
  mutate(
    drop_flag = rbinom(n(), 1, 0.5),
    X_mcar = if_else(drop_flag == 1, NA_real_, X)
  )

MCAR with Listwise Deletion

# Only keep rows where X_mcar is observed (listwise deletion)
mcar_cc <- mcar_data |> filter(!is.na(X_mcar))
mcar_fit <- lm(Y ~ X_mcar + E, data = mcar_cc)

nrow(mcar_cc)
[1] 2522
summary(mcar_fit)

Call:
lm(formula = Y ~ X_mcar + E, data = mcar_cc)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7022 -1.3534  0.0086  1.3901  7.7041 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.97448    0.05698   34.65   <2e-16 ***
X_mcar       0.47749    0.04102   11.64   <2e-16 ***
E            1.02528    0.10239   10.01   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.043 on 2519 degrees of freedom
Multiple R-squared:  0.1916,    Adjusted R-squared:  0.1909 
F-statistic: 298.5 on 2 and 2519 DF,  p-value: < 2.2e-16

Summary of Differemnces


Call:
lm(formula = Y ~ X_mcar + E, data = mcar_cc)

Coefficients:
(Intercept)       X_mcar            E  
     1.9745       0.4775       1.0253  
# A tibble: 2 × 3
  method            estimate     se
  <chr>                <dbl>  <dbl>
1 Full data (truth)    0.500 0.0288
2 MCAR listwise        0.477 0.0410

Power Analysis: MCAR vs. Complete Data

MCAR: estimates center on truth; uncertainty widens as more X is missing.

MCAR: Not a Big Deal But Rarely Plausible

  • MCAR is the happy case: just drop mothers with missing nutrition, move on.
  • You lose precision/power, but no bias.
  • Problem: MCAR is seldom plausible, except in a few cases (e.g., random measurement instrument failure)

MAR: Missing At Random

MAR: Nonresponse Driven by E and X

  • Missingness rises with an observed factor (e.g., education, clinic context) and when nutrition is low.
  • Education affects both nutrition and birthweight.
  • Dropping cases with missing values up-weights higher-education/higher-nutrition mothers, biasing the nutrition effect.

MAR Scenario Set

We will look at:

  • A single “truth” dataset (n = 20,000) with no missingness
  • MAR cases with 10%, 30%, and 60% of X missing
  • For each level: one scenario where missingness is driven by E (observable) and one where missingness is driven by X itself (worst case)

DAG

MAR Scenarios, summarized

  • No Y/E bias here; we only are missing on X here
  • Bias increases with more missing data
  • Bias increased when relationship between X, E, and p(missing) is higher
  • The highest bias scenario has a particularly strong link between the true value of X and p(x missing)
scenario N Y mean Y sd Y bias X mean X sd X bias E mean E sd E bias
Full data 20000 2.87 2.24 0 0.73 1.25 0.00 0.49 0.5 0
10% missing, E-driven 17898 2.87 2.24 0 0.81 1.25 0.08 0.49 0.5 0
10% missing, X-driven 18018 2.87 2.24 0 0.96 1.09 0.23 0.49 0.5 0
30% missing, E-driven 14020 2.87 2.24 0 0.99 1.23 0.26 0.49 0.5 0
30% missing, X-driven 13971 2.87 2.24 0 1.33 0.91 0.60 0.49 0.5 0
60% missing, E-driven 8121 2.87 2.24 0 1.29 1.12 0.56 0.49 0.5 0
60% missing, X-driven 8000 2.87 2.24 0 1.90 0.74 1.17 0.49 0.5 0

MAR Scenarios: Naive Regression (Y ~ X)

  • No adjustment for E; drop rows with missing X.
  • True \(β_X\) = 0.50
  • E affects both Y, X, and p(x missing); these models are both MNAR and mis-specified
  • MNAR and mis-specified models tend to go hand-in-hand.
Scenario N β_X SE(β_X)
Full data 20000 0.734 0.012
10% missing, E-driven 17898 0.734 0.012
10% missing, X-driven 18018 0.745 0.014
30% missing, E-driven 14020 0.710 0.014
30% missing, X-driven 13971 0.744 0.019
60% missing, E-driven 8121 0.626 0.020
60% missing, X-driven 8000 0.717 0.031

MAR Scenarios: Adjusted Regression (Y ~ X + E)

  • This model is correctly specified. “No backdoor paths.”
  • Observed scenarios drop cases with missing X.
  • True \(β_X\) = 0.50
  • Even without any imputation, the adjusted model recovers \(β_X\) reasonably well across scenarios.
  • The main consequence is power loss, not bias.
Scenario N β_X SE(β_X) β_E SE(β_E)
Full data 20000 0.497 0.014 0.988 0.035
10% missing, E-driven 17898 0.496 0.015 1.001 0.037
10% missing, X-driven 18018 0.493 0.017 0.985 0.036
30% missing, E-driven 14020 0.500 0.017 0.945 0.044
30% missing, X-driven 13971 0.501 0.021 0.998 0.040
60% missing, E-driven 8121 0.489 0.022 1.015 0.073
60% missing, X-driven 8000 0.560 0.032 0.886 0.059

When MAR Holds (and When It Doesn’t)

  • If we model the variables that drive missingness (here, X and E), listwise deletion analyses stay near the truth.
  • In that case, missing data mostly hurts precision: fewer cases translates to bigger SEs.
  • Trouble begins when key drivers are unmeasured or omitted (e.g., X-driven missingness with only Y ~ X): then MAR is violated relative to the analysis, and bias appears.
  • In real workflows we can combine modeling plus imputation; this performs best.

Common, Non-Ideal Imputation Tactics

  • We apply four historically common tactics to each MAR scenario (E-driven and X-driven at 10/30/60% missing) plus the full, no-missing baseline.
  • Every regression keeps \(E\) in the model (since we observe it); what changes is how we reconstruct \(X\).
  • Tables on the next slides show \(β_X\), standard errors, and how far each method drifts from the true 0.50.

Mean Imputation

  • Insert a single cohort-wide mean for every missing nutrition value.
  • Mechanics: compute the mean of observed \(X\), then plug that constant into each missing slot
  • All imputed mothers receive the same nutrition score, so their points line up on the same vertical slice of the scatter.

Mean Imputation: Simulation Results

Scenario Missing N listwise β_X listwise SE listwise N β_X SE(β_X) β_X bias X mean X sd
Full data 0% 20000 0.497 0.014 20000 0.497 0.014 -0.003 0.73 1.25
10% missing, E-driven 11% 17898 0.496 0.015 20000 0.470 0.015 -0.030 0.81 1.18
10% missing, X-driven 10% 18018 0.493 0.017 20000 0.418 0.016 -0.082 0.96 1.03
30% missing, E-driven 30% 14020 0.500 0.017 20000 0.410 0.016 -0.090 0.99 1.03
30% missing, X-driven 30% 13971 0.501 0.021 20000 0.373 0.020 -0.127 1.33 0.76
60% missing, E-driven 59% 8121 0.489 0.022 20000 0.407 0.021 -0.093 1.29 0.71
60% missing, X-driven 60% 8000 0.560 0.032 20000 0.421 0.032 -0.079 1.90 0.47

Mean Imputation vs. Listwise Deletion

Mean Imputation: Problems

  • Even with \(E\) in the regression, bias is worse with mean imputation than just using listwise deletion.
  • In this case, shrinks the slope toward zero: missing cases on average have worse nutrition; we replace those missing values with the data from cases where highly educated, well-nourished mothers are over-represented.
  • Standard are too small because we flooded our data with identical, non-varying values.

Mean + Indicator

  • Start with mean imputation, then append a binary “missing” indicator to the regression.
  • Mechanics: fill missing \(X\) with the cohort mean, create miss_ind = 1 when \(X\) was imputed, and fit \(Y \sim X + E + miss\_ind)\).
  • Intuition: mothers with missing nutrition get the same slope as everyone else but a separate intercept shift governed by the indicator.

Mean + Indicator: Simulation Results

Scenario Missing N listwise β_X listwise SE listwise N β_X SE(β_X) β_X bias X mean X sd
Full data 0% 20000 0.497 0.014 20000 0.497 0.014 -0.003 0.73 1.25
10% missing, E-driven 11% 17898 0.496 0.015 20000 0.492 0.015 -0.008 0.81 1.18
10% missing, X-driven 10% 18018 0.493 0.017 20000 0.490 0.016 -0.010 0.96 1.03
30% missing, E-driven 30% 14020 0.500 0.017 20000 0.459 0.016 -0.041 0.99 1.03
30% missing, X-driven 30% 13971 0.501 0.021 20000 0.488 0.020 -0.012 1.33 0.76
60% missing, E-driven 59% 8121 0.489 0.022 20000 0.409 0.021 -0.091 1.29 0.71
60% missing, X-driven 60% 8000 0.560 0.032 20000 0.495 0.031 -0.005 1.90 0.47

Mean + Indicator vs. Listwise Deletion

Mean + Indicator: Problems

  • In this case, adding an indicator reduced bias a good deal compared to mean imputation
  • Treats missingness as a fixed effect, so can exacerbate bias on other coefficients (e.g., \(\beta_E\)) correlated with missingness

Hot-Deck Imputation: Mechanics

  • For each missing nutrition value, randomly draw a donor mother who reported \(X\) and copy her score.
  • Mechanics: sample with replacement from observed \(X\)’s, replace the NA’s, and analyze \(Y \sim X + E\) on the filled in dataset.

Hot-Deck Imputation: Simulation Results

Scenario Missing N listwise β_X listwise SE listwise N β_X SE(β_X) β_X bias X mean X sd
Full data 0% 20000 0.497 0.014 20000 0.497 0.014 -0.003 0.73 1.25
10% missing, E-driven 11% 17898 0.496 0.015 20000 0.392 0.014 -0.108 0.81 1.25
10% missing, X-driven 10% 18018 0.493 0.017 20000 0.344 0.015 -0.156 0.96 1.09
30% missing, E-driven 30% 14020 0.500 0.017 20000 0.263 0.013 -0.237 0.99 1.22
30% missing, X-driven 30% 13971 0.501 0.021 20000 0.257 0.017 -0.243 1.33 0.91
60% missing, E-driven 59% 8121 0.489 0.022 20000 0.177 0.013 -0.323 1.30 1.12
60% missing, X-driven 60% 8000 0.560 0.032 20000 0.141 0.020 -0.359 1.90 0.74

Hot-Deck Imputation vs. Listwise Deletion

Hot-Deck Imputation: Problems

  • In this case, performs worse than mean imputation, vastly eaxcerbating bias from listwise deletion
  • Works when missingness is MCAR; once it depends on \(E\) or \(X\), donor pools misrepresent the missing cases.
  • Still leads to overly tight SEs because draws are constrained to known cases. In our case, these known cases are from a tighter distribution than the truth.

Regression Imputation: Mechanics (Single Driver)

  • Estimate a predictive model \(\widehat{X} = \widehat{\alpha} + \widehat{\gamma} E\) using only mothers with observed nutrition.
  • \(\widehat{\alpha}\) is the fitted intercept (baseline nutrition when \(E = 0\)); \(\widehat{\gamma}\) is the estimated effect of \(E\) on nutrition.
  • Mechanics: use the observed \(E\) to predict \(X\) for every missing case, replace the NA with that prediction, and then run the outcome regression.
  • In our simulation, low-education mothers receive imputations near \(\widehat{\alpha}\) while high-education mothers get \(\widehat{\alpha} + \widehat{\gamma}\)—mirroring the DAG structure instead of collapsing to a single mean.

Regression Imputation: Mechanics (Multiple Drivers)

  • When several observed variables \(Z_1, Z_2, \ldots, Z_p\) relate to nutrition or missingness, fit \(\widehat{X} = \widehat{\alpha} + \sum_{j=1}^p \widehat{\gamma}_j Z_j\) among mothers who reported \(X\).
  • Each coefficient \(\widehat{\gamma}_j\) shows how that covariate shifts expected nutrition; the imputed value is the linear predictor for each missing case.
  • In richer data, the \(Z_j\) could include clinic context, SES, prior visits, etc.—anything observed that plausibly drives both nutrition and missingness.

Model-based Imputation Techniques

  • With rich data sets, you do not have to limit your missing data model to covariates in your analysis model
  • The idea is to model each variable with the most informative model you have
  • This is similar to propensity score models, where you model the treatment mechanism to produce matched sets or weights to achieve balance

Regression Imputation: Simulation Results

Scenario Missing N listwise β_X listwise SE listwise N β_X SE(β_X) β_X bias X mean X sd
Full data 0% 20000 0.497 0.014 20000 0.497 0.014 -0.003 0.73 1.25
10% missing, E-driven 11% 17898 0.496 0.015 20000 0.496 0.015 -0.004 0.73 1.21
10% missing, X-driven 10% 18018 0.493 0.017 20000 0.493 0.017 -0.007 0.90 1.05
30% missing, E-driven 30% 14020 0.500 0.017 20000 0.500 0.017 0.000 0.73 1.13
30% missing, X-driven 30% 13971 0.501 0.021 20000 0.501 0.021 0.001 1.20 0.81
60% missing, E-driven 59% 8121 0.489 0.022 20000 0.489 0.023 -0.011 0.74 0.98
60% missing, X-driven 60% 8000 0.560 0.032 20000 0.560 0.033 0.060 1.72 0.53

Regression Imputation vs. Listwise Deletion

Regression Imputation: Problems

  • SEs are typically too optimistic: every missing X is deterministically set
  • Multiple imputation could add back the right amount of noise and uncertainty.
  • Remember: the goal is unbiased estimates of both \(\beta_X\) and \(SE(\beta_X)\)

Deterministic Imputation ≠ More Power

  • Same line, more dots: Single imputation pins every filled \(X\) to the \(E\)-based guess, so the points stack on one line and add almost no new information about the slope.
  • Spread, not headcount, drives SEs: Because the spread of \(X\) at each level of \(E\) stays tiny, the slope’s standard error barely budges even though more rows show up.
  • False confidence risk: Complete cases can look “precise” simply because they ignore how unsure we are about the missing nutrition scores.

Where Do We Go From Here?

  • We need to put believable spread back into the missing rows rather than printing the same predicted value over and over.
  • Multiple imputation helps mitigate this problem

Multiple Imputation

  • Treat the missing nutrition scores as random draws from a model that predicts \(X\) using the information we do have (here, \(E\)).
  • Do that filling process multiple times, fit \(Y \sim X + E\) in each completed data set, and then average the estimates using Rubin’s rules.

Multiple Imputation: Simulation Results

Scenario Missing N listwise β_X listwise SE listwise β_X (MI) SE (MI) β_X bias (MI)
Full data 0% 20000 0.497 0.014 0.497 0.014 -0.003
10% missing, E-driven 11% 17898 0.496 0.015 0.502 0.015 0.002
10% missing, X-driven 10% 18018 0.493 0.017 0.494 0.017 -0.006
30% missing, E-driven 30% 14020 0.500 0.017 0.495 0.017 -0.005
30% missing, X-driven 30% 13971 0.501 0.021 0.502 0.023 0.002
60% missing, E-driven 59% 8121 0.489 0.022 0.462 0.027 -0.038
60% missing, X-driven 60% 8000 0.560 0.032 0.553 0.028 0.053

Multiple Imputation vs. Listwise Deletion

Why Multiple Imputation Helps

  • Adds plausible wiggle: MI draws \(X^* = f(E) + \varepsilon\), where \(f(E)\) is the best guess for nutrition given \(E\) and \(\varepsilon\) is a random residual sampled from the estimated spread. That restores the variation we would have seen if \(X\) were observed.
  • Honest uncertainty: Rubin’s imputation combination rules blend the within- and between-imputation variance, so SEs stay truthful even when missingness is heavy.
  • Bias correction: Because each draw reflects how \(X\) and \(E\) relate, the averaged estimate lands near the true \(\beta_X\).

Multiple Imputation: Takeaways

  • MI lands on the true \(\beta_X\) even in the harsh scenario where values of X drive missingness.
  • The combined SEs stay honest because Rubin’s rules keep the run-to-run variation in the mix.
  • Next session we’ll dig into more details on how MI works and how to implement it yourself.

MNAR: Missing Not At Random

MNAR & Imputatoin

  • Multiple imputation is not a magic bullet
  • If you do not have rich enough covariates to accurately model p(missing X), there is missing data strategy that can save your analysis.
  • Lets finish by examining such a scenario

MNAR DAG: Hidden Drivers

MNAR Stress Test Setup

  • Nutrition, income, provider quality, family stability, and class are all correlated.
  • We only observe education and insurance, so any missingness tied to the other pieces is effectively MNAR for us.
  • Lets compare listwise deletion vs. MI when we have a complex causal graph with unobserved data

MNAR Simulation Details

  • Simulated 2,000 mothers with seven covariates; truth: \(\beta_X = 0.50\) in \(Y \sim X + \text{education} + \text{insurance}\).
  • Missing nutrition is more likely for low income, low provider quality, fragile family, lower class position mothers.
  • Analysts only get \(Y\), education, insurance, and nutrition (with missingness).

MNAR Table 1: Who’s in Each Sample?

  • We can see here by using MI, our summary statistics look more accurate
  • The MAR illustration showed how listwise deletion recovered accurate model estimates
  • However, an accurate Table 1 of summary statistics also benefits from imputation!
Sample Mean X SD X Mean educ Mean insurance Mean Y SD Y
Full data 0.00 1.00 0.00 0.00 3.03 2.10
Listwise 0.28 0.97 0.33 0.29 3.45 2.04
MI (avg) 0.13 0.98 0.00 0.00 3.03 2.10

MNAR Model Specs

  • Truth (omniscient): Y ~ X + education + insurance + income + provider + family + class.
  • What we can actually fit: \(Y \sim X + \text{education} + \text{insurance}\) on whatever \(X\) values we have.
  • We compare three versions of that observed model: full data (for reference), listwise deletion, and MI.

MNAR Results: Listwise vs. MI

  • Omniscient model (with hidden drivers) nails \(\beta_X\), as expected.
  • Even with full nutrition but no hidden drivers, we drift because the omitted variables matter.
  • Once nutrition goes missing, both listwise and MI lean on education/insurance; MI can actually look worse than listwise deletion!
Method β_X SE Bias
Omniscient (X + hidden drivers) 0.484 0.052 -0.016
Observed, no missing 0.563 0.050 0.063
Listwise deletion 0.564 0.069 0.064
Multiple imputation 0.585 0.067 0.085

MNAR: No Easy Fix

  • Both listwise deletion and MI miss the truth because neither sees the hidden drivers of missingness.
  • MI can even overshoot when missingness leans on factors we never measure—its imputed \(X\) just mirrors education/insurance again.
  • Only extra information (new variables, external data, or sensitivity analyses) can break out of this corner.

Conclusion

Takeaways

  • Handling missing data well is all about good causal inference, on both Y and your missing Xs
  • No missing data strategy can fix an impoverished data set
  • You need to spend a good deal of time thinking about your causal graphs/DAGs before you implement imputation strategies
  • Don’t be afraid of simulating your specific data; you can’t observe your missing Xs, but you can simulate scenarios that tell you what would happen to your results under various conditions.

Thank You!

  • Erik Westlund
  • Johns Hopkins Biostatistics Center
  • ewestlund@jhu.edu